Introduction

SMDX data

SDMX, which stands for Statistical Data and Metadata eXchange is an international initiative that aims at standardising and modernising (“industrialising”) the mechanisms and processes for the exchange of statistical data and metadata among international organisations and their member countries

Eurostat, OECD, UN, World Bank, IMF and many others use SDMX in order to transit their datasets.

Using R one can handle these datasets either with the rsdmx library or with specific APIs, like the eurostat library: https://ropengov.github.io/eurostat/

Data download

Every dataset from Eurostat can be defined by a unique name. We just need to provide this name to the API, for example:

## Table tin00028 cached at /tmp/RtmpOTDlji/eurostat/tin00028_date_code_TF.rds
## # A tibble: 1,712 x 6
##    ind_type  unit   indic_is geo   time       values
##    <fct>     <fct>  <fct>    <fct> <date>      <dbl>
##  1 IND_TOTAL PC_IND I_ILT12  AT    2007-01-01     69
##  2 IND_TOTAL PC_IND I_ILT12  BE    2007-01-01     69
##  3 IND_TOTAL PC_IND I_ILT12  BG    2007-01-01     34
##  4 IND_TOTAL PC_IND I_ILT12  CY    2007-01-01     41
##  5 IND_TOTAL PC_IND I_ILT12  CZ    2007-01-01     52
##  6 IND_TOTAL PC_IND I_ILT12  DE    2007-01-01     75
##  7 IND_TOTAL PC_IND I_ILT12  DK    2007-01-01     85
##  8 IND_TOTAL PC_IND I_ILT12  EA    2007-01-01     61
##  9 IND_TOTAL PC_IND I_ILT12  EE    2007-01-01     66
## 10 IND_TOTAL PC_IND I_ILT12  EL    2007-01-01     36
## # … with 1,702 more rows

It is a good idea, especially if we work with big datasets, to download and save locally the files. For example:

Plotly

plotly is js library for interactive graphics. Here is an example how easly can be used:

How data look like

eurostat API returns data as data frames (tibbles in tidyverse terminology) with the SMDX format preserved. Here is how data look like:

## # A tibble: 1,712 x 6
##    ind_type  unit   indic_is geo   time       values
##    <chr>     <chr>  <chr>    <chr> <date>      <dbl>
##  1 IND_TOTAL PC_IND I_ILT12  AT    2007-01-01     69
##  2 IND_TOTAL PC_IND I_ILT12  BE    2007-01-01     69
##  3 IND_TOTAL PC_IND I_ILT12  BG    2007-01-01     34
##  4 IND_TOTAL PC_IND I_ILT12  CY    2007-01-01     41
##  5 IND_TOTAL PC_IND I_ILT12  CZ    2007-01-01     52
##  6 IND_TOTAL PC_IND I_ILT12  DE    2007-01-01     75
##  7 IND_TOTAL PC_IND I_ILT12  DK    2007-01-01     85
##  8 IND_TOTAL PC_IND I_ILT12  EA    2007-01-01     61
##  9 IND_TOTAL PC_IND I_ILT12  EE    2007-01-01     66
## 10 IND_TOTAL PC_IND I_ILT12  EL    2007-01-01     36
## # … with 1,702 more rows
## [1] "tbl_df"     "tbl"        "data.frame"
## Classes 'tbl_df', 'tbl' and 'data.frame':    1712 obs. of  6 variables:
##  $ ind_type: chr  "IND_TOTAL" "IND_TOTAL" "IND_TOTAL" "IND_TOTAL" ...
##  $ unit    : chr  "PC_IND" "PC_IND" "PC_IND" "PC_IND" ...
##  $ indic_is: chr  "I_ILT12" "I_ILT12" "I_ILT12" "I_ILT12" ...
##  $ geo     : chr  "AT" "BE" "BG" "CY" ...
##  $ time    : Date, format: "2007-01-01" "2007-01-01" ...
##  $ values  : num  69 69 34 41 52 75 85 61 66 36 ...

More pretty table

Data display in the console is not always enough. A better and more efficiency way to explore data is to use the DT::datatable function. Here is an example:

Eurostat distributed datasets

Eurostat has to collect, process and distribute:

  • values for something (for example internet usage)
  • for several countries or regions of countries
  • for several times (yearly, quertely or monthly data)

With a common and stable format

  • as time passes by
  • as new countries enter
  • some countries might go

Eurostat web interface

Eurostat: Internet use by individuals

https://ec.europa.eu/eurostat/data/database?node_code=tin00028

Eurostat web interface (2)

Eurostat: Internet usage by individuals

https://ec.europa.eu/eurostat/tgm/table.do?tab=table&plugin=1&language=en&pcode=tin00028

Eurostat Excel files

Disadvantages

  • Dirty data
  • Need some data cleaning before processing
  • Mixes data with descriptions
  • Separate coluns for each year!

but:

What You See is What You Get

SMDX data storage and use

Of course, we do not store datasets in this format.

So, there is no reason to use them in this format.

Forget Excel files and use directly SDMX tables. It’s Easy!

  • Locations (countries, regions and etc) are stored in one single column: geo
  • Time (years, quarters, months, etc) are stored in one sigle column: time is ISO format (YYYY-MM-DD)
  • Values (of anything) are stored in one single column: values
  • Other characteristics are stored in separate columns
  • The triplet (geo, time, values) is common everywhere

That’s easy!

INSERT, DELETE, UPDATE data

  • If we have to add a new value, we have to INSERT a row
  • If we have to delete a value, we DELETE a row
  • If we have to update a value, we UPDATE a row

This rowwise procedure, very similar to relational databases and SQL is essential.

We do not perform these actions to the actual data sets, we ARE NOT the Eurostat data administrators.

But as we DO data management and analysis we have to do it sometimes.

So, in every dataset:

  • Every column is a variable
  • Every row is an observation

Let’s get the EU country names

Eurostat library provides the table (data frame) eu_countries with the EU (curently) names and ISO 3166-1 two-letter codes:

We will use info from this table in order to exclude data fro other countries in the following slides: we will focus only in the EU-28 member countries.

EU28

We will create a new tibble as follows:

Let’s view the result:

RColorBrewer

When plotting we use colors. It is a good idea to see the available color palettes, provided by the RColorBrewer library:

Explore a palette

There is also a way to examine specific colors of a palette. For example here are 5 colors of the RdBu palette:

Or, we can see the hex codes of the colors:

## [1] "#CA0020" "#F4A582" "#F7F7F7" "#92C5DE" "#0571B0"

Plot the Debt

teina225

Table teina225 http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=teina225&lang=en provides the annual general government gross debt:

## # A tibble: 768 x 7
##    unit    sector na_item geo   time         values  Year
##    <chr>   <chr>  <chr>   <chr> <date>        <dbl> <dbl>
##  1 MIO_EUR S13    GD      AT    2007-01-01  184674.  2007
##  2 MIO_EUR S13    GD      BE    2007-01-01  299997.  2007
##  3 MIO_EUR S13    GD      BG    2007-01-01    5297.  2007
##  4 MIO_EUR S13    GD      CY    2007-01-01    9462.  2007
##  5 MIO_EUR S13    GD      CZ    2007-01-01   39608.  2007
##  6 MIO_EUR S13    GD      DE    2007-01-01 1600016.  2007
##  7 MIO_EUR S13    GD      DK    2007-01-01   63755   2007
##  8 MIO_EUR S13    GD      EA18  2007-01-01 6108606.  2007
##  9 MIO_EUR S13    GD      EA19  2007-01-01 6113216.  2007
## 10 MIO_EUR S13    GD      EE    2007-01-01     595.  2007
## # … with 758 more rows

teina225 table view

For more convinient view of the data:

teina225, unit - sector - na_item

We look at the structure of the dataset:

## Classes 'tbl_df', 'tbl' and 'data.frame':    768 obs. of  7 variables:
##  $ unit   : chr  "MIO_EUR" "MIO_EUR" "MIO_EUR" "MIO_EUR" ...
##  $ sector : chr  "S13" "S13" "S13" "S13" ...
##  $ na_item: chr  "GD" "GD" "GD" "GD" ...
##  $ geo    : chr  "AT" "BE" "BG" "CY" ...
##  $ time   : Date, format: "2007-01-01" "2007-01-01" ...
##  $ values : num  184674 299997 5297 9462 39608 ...
##  $ Year   : num  2007 2007 2007 2007 2007 ...

and it is a good idea to examine the domain of the values of the columns:

## # A tibble: 2 x 1
##   unit   
##   <chr>  
## 1 MIO_EUR
## 2 PC_GDP
## # A tibble: 1 x 1
##   sector
##   <chr> 
## 1 S13
## # A tibble: 1 x 1
##   na_item
##   <chr>  
## 1 GD

One has to refer to Eurostat’s website for more information. Clearly, if we do this a first time we have to carfully read the specifications: what exactly is the numeric value we are looking at.

For the subsequent work we will with debt as percentage of the GDP, so we will be interested in PC_GDP (unit column).

Calculation of some usefull quantities

debt of 2017 and diff from 2018

Difference of debt between 2017 and 2018

Plot of debt change (plot)

Choropleths and geofacets

Europe’s shapefiles

Shapefiles can be downloaded directly as follows.

## Simple feature collection with 28 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -63.08825 ymin: -21.39077 xmax: 55.83808 ymax: 70.08134
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## First 10 features:
##           NUTS_NAME geo                       geometry
## 1        ÖSTERREICH  AT MULTIPOLYGON (((15.75363 48...
## 2  BELGIQUE-BELGIË   BE MULTIPOLYGON (((4.75993 51....
## 3          БЪЛГАРИЯ  BG MULTIPOLYGON (((22.99717 43...
## 4            ΚΥΠΡΟΣ  CY MULTIPOLYGON (((32.27382 35...
## 5   ČESKÁ REPUBLIKA  CZ MULTIPOLYGON (((14.6188 50....
## 6       DEUTSCHLAND  DE MULTIPOLYGON (((9.1131 54.8...
## 7           DANMARK  DK MULTIPOLYGON (((10.19436 56...
## 8             EESTI  EE MULTIPOLYGON (((25.83016 59...
## 9            ΕΛΛΑΔΑ  EL MULTIPOLYGON (((26.03276 40...
## 10          ESPAÑA   ES MULTIPOLYGON (((-7.03184 43...

European Union map

EU, debt change choropleth - colorNumeric Spectral

EU, debt change choropleth - colorNumeric RdBu

EU, debt change choropleth - colorNumeric RdBu - reverse

EU, debt change choropleth - colorBin 4 - reverse

EU, debt change choropleth - colorBin 7 - reverse

EU, debt change choropleth - colorQuantile 4 - reverse

Change the palette to Quantile colors (4 quantiles) and reverse the sequense of colors.

EU, debt change choropleth - colorQuantile 5 - reverse

Change the palette to Quantile colors (5) and reverse the sequense of colors. See the different map below.

EU, debt change choropleth - colorQuantile - Set1

A qualitive palette, for example Set1 is a very bad choise, See Below.

Any remards?

28 separate line plots, facet_geo + animation

Bivariate choropleths

drawing

Growth and deficit/surplus of EU countries during 2018

Inspired by Timo Grossenbacher https://timogrossenbacher.ch/2019/04/bivariate-maps-with-ggplot2-and-sf/

Unfinished, Rosling style

maastricht

Years of crisis

GDS